fix: respect CPU affinity (SLURM cgroups) when auto-detecting num_threads#236
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a helper function _available_cpus to accurately determine the number of available CPUs by respecting process affinity on Linux, falling back to the total CPU count on other platforms. The MetricsEvaluator class was updated to utilize this function when num_threads is set to -1. Feedback suggests enhancing the robustness of the CPU detection by handling potential OSError exceptions in restricted environments and ensuring a minimum return value of 1.
| try: | ||
| return len(os.sched_getaffinity(0)) | ||
| except AttributeError: | ||
| return mp.cpu_count() |
There was a problem hiding this comment.
It is recommended to catch OSError in addition to AttributeError. In some restricted or containerized environments, os.sched_getaffinity might be present in the os module but blocked by security policies (e.g., seccomp), which would raise an OSError. Additionally, ensuring the returned value is at least 1 is a good defensive practice for downstream tools like Numba that require a positive thread count.
| try: | |
| return len(os.sched_getaffinity(0)) | |
| except AttributeError: | |
| return mp.cpu_count() | |
| try: | |
| return max(1, len(os.sched_getaffinity(0))) | |
| except (AttributeError, OSError): | |
| return max(1, mp.cpu_count()) |
There was a problem hiding this comment.
I think that's overengineering and doesn't really add much benefit for four new lines of code.
|
Should we bump the version for this change, or just roll it in with the next version bump? |
Since previous releases also mostly consisted of a single PR, I’ll also publish a cell-eval 0.7.2 release with this one. |
Summary
MetricsEvaluatorcrashed under SLURM (--cpus-per-task=N) when the host had more CPUs than the job allocation, becausemp.cpu_count()returns host-wide CPUs and ignores cgroup/cpuset limits.The oversized thread count was forwarded to
numba.set_num_threads()viapdex, raisingValueError: The number of threads must be between 1 and N.mp.cpu_count()with an affinity-aware_available_cpus()helper usingos.sched_getaffinity(0), with amp.cpu_count()fallback for macOS/Windows (those platforms don't run SLURM and have no cgroup caps in practice).num_threads == -1sentinel at the top ofMetricsEvaluator.__init__so the resolved value is used everywhere downstream rather than only at the_build_de_comparisoncall site.os.sched_getaffinityis stdlib and matches the same source numba uses internally to size its threadpool.